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DESIGN OF POLYKETIDE SYNTHASE GENES 

CROSS REFERENCE TO RELATED APPLICATIONS 

[1] This applications asserts pnonty to U.S. Provisional Application Nos, 
60/237 382 filed October 4, 2000 by inventor Daniel San.i entitled DEVELOPMENT 
AND SCREENING OF A VIRTUAL POLYKETIDE LIBRARY; and 60/207,331 filed 
May 30 2000 by inventors Daniel Santi, Michael Sian, and Chaitan Khosla entitled 
DESIGN OF POLYKETIDE SYNTHASE GENES, all of which are incorporated herein 
by reference. 

CD APPENDIX 

[2] This disclosure includes a CD appendix. 

FIELD OF INVENTION 

[3] The present invention provides methods for the analysis of polyketides 
and the design of polyketide synthase genes. The invention relates to the fields of 
computational analysis, chemistry, molecular biology, and medicine. 

BACKGROUND OF THE INVENTION 

[4 ] The class of compounds known as polyketides is a large family of diverse 
compounds synthesized primarily from 2-carbon unit building block compounds through 
a senes of condensations and subsequent modifications. Polyketides occur in many types 
of organisms, including fungi and mycelial bacteria such as the actinomycetes. There are 
a wide variety of polyketide structures, and the class of polyketides encompasses 
numerous compounds with diverse activities. Epothilone, erythromycin, FK-506, FK- 
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520, megalomicm, narbontycn, oleandomycin, picromycm, rapamycn, sp.nocy,, and 
tylosin are examples of such compounds, 

[5] Given the difficulty in producing polyket.de compounds by tradr.ional 
chemical methodology, and the typically tow production of peptides in w„d type cells, 
tee as been considerable interest in finding improved or alternate means to produce 
polyketidecompounds. See PCX Publication Nos. WO 95/08548; WO 96/40968; WO 
97/02358- and 98/27203; Unites States Patent Nos, 5,962,290; 5,672,491; and 5,7.2,146; 
Fu et al Biochemistry 33: 9321-9326 (1994); McDaniel et al, Science 262:1546-1555 
,,993); and Rohr, Angew. Chem. !nt, Ed. Engl. 34(8): 88,-888 (1995), each of which ,s 
;orporated herein by reference. 
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[6] Polyketides are synthesized in nature by polyketide synthase (PKS) 
enzymes. These enzymes, which are complexes of multiple large proteins, are similar to 
,he synthases that catalyze condensation of 2-carbon unit budding b,ock compounds m 
the biosynthesis of fatty acds. The genes that encode PKS enzymes usually consist of 
three or more open reading frames (ORFs). Two major types of PKS enzymes are taown 
tha , differ ■„ their composition and mode of synthesis. These two major types of PKS 
enzymes are commonly referred to as Type I or "modular" and Type or "iterative PKS 



enzymes. 



[7] Modular PKSs produce many different polyketides, including a large 
number of 12-, 14-, and 16-membered macrol.de antibiotics including erythromycin, 
megalom.cin, methymyen, neomycin, oleandomycin, pteromyen, and tylos.n. Each 
ORE of a modular PKS can compose one, two, or more "modules" of ketosynthase 
activity each module of which consists of a, .east two (if a loading module, and more 
typically three (for the simplest extender module) or more enzymatic activities or 
"domains." These large multifunctional enzymes (>300,000 kDa) catalyze the 
biosynthesis ofpolyket.de macro.actones through muHistep pathways involving 
decarboxylase condensations between acy. thtoesters followed by cycles of varying f,- 
earbon processing activities (see O'Hagan, D, The polyketide metabolites, E. Horwood, 
New York, 1991, which is incorporated herein by reference). 
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m During «he pas. half decade, the study of modular PKS funct.on and 

, c- 1^-^09 512 (1994), McDaniel era/., Science 262. 1540 
eenes(seeKaoeffl/.,5cie«ce,265.509 MZ(iyv 

n .x, U7? AQ1 and 5 712 146, each of which is 
1557 (1993), and U.S. Patent Nos. 5,672,491 ana^n^ 

natural DEBS host organ.sm, S^aropo^ora eryikraea, allows more fade 
proving a "clean" host background. Th.s system also expedited construct of a 

98/493 15, incorporated herein by reference). 

[9] The abrlity to control aspects ofpo.yket.de biosyn.hes.s, such as monomer 
s e,ec,,o„a„dde g reeof^arbon P roeess,n g .by g ene.icma„.pu,at,onofPKSshas 

emulated great .n.erest in the combinatory, engineering of novel ant.b.ot.cs (see 
ZtJa*. 0 P , Microbiol. , 3,»» Carreras and San,, Mj 

„,:40 3 -4,l ( . 9 9S) ; andU,.r^ 

eachofwhich.s.ncorporatedhereinbyrcferenc^Th.s.nteresthasresultedrn* 
cloning, analysts, and manrpulation hy reeombmant DN A technology of genes that 
encod PKS enzymes. The resu.tmg technology allows one to manipulate a known KS 

, ha „occ„rin„a,ureorinhos,s,ha, otherwise do ^^^^J^ 
technology also allows one to produce molecules tha. are Rurally related to, hut 
drstinc, from, the polyketrdes produced from known PKS gene clusters. 

m Po.yket.des are assembled by polyketide synthases through successive 

acids such as aec.a.e, propionate, and butyrate. Active s.tes reared for con ens on 
:ldea„acy,,ransferase(AT),acy,ca m erprote,n ( ACP,,andheta«^ 
(KS). Each condensate cycle results in a p-ke,o group that undergoes a.,, some, 
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f «ino activities Active sites that perform these reactions include 
none of a series of processing activities, a 

ak etored U ^ 

anyb eta- k eto P ™^ 

u j vl a KR and DH result in an alkene, while a KR, DH, and ER 

g , y cosy.,,on,ox 1 da tl on,ac yl a t ,o„),oach 1 eve,hef,„aUc t iveco m pou„d. 

, s ,82-63(1998) incorporated herem by reference), one can refer to the PKS 

tic p aten tNo 5,824,513, incorporated herem by reference), 
see U.S. Patent No. conde „ s at,„n and reduct.on are 

extender modules and a chain terminating thioesterase (TE) ° mal " ^^^gj 
extremely ,ar g e polypepttdes encoded by three open readm g frames (OREs, des lg n 

eryAI, eryAII, and eryAIII). 

[n] E achofthe th reepo,ypept I desnbuni,sofDEBS(DEB,,DEBS2,and 

andsixm e,hy,malo„y,Co^ : nd^ 

rrr^m — , w ;doma,n,o,,o W1 n g t h eco„de ; a, : a„d 

ZpZnate dehydration and reduction reactions, the enzyme bound tntermedta, s 
rnledbytheTEattheendofextendermoduleetoforma-dEBCcompound.m 



Figure 1). 



(13] Morepantcularly.theloadtn.moduleofDEBSeonststsoftwodomain, 
an J l I fcrase(AT)d oma 1 na„da„acylca m erprote,„( A CP,doma 1 ,lno,herPKS 
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i f OT1 at and an ACP but instead utilizes a 
eon.red for —on act.v.ty. AUhough the KS« doma.n .a* condens . , 
DEBS) and transfers i, to the ACP of that modute to form a th.oester. Once the PKS „ 

shoLnHgure,,, — from the ACP ,o the KS of the next modu.e, andthe 
process continues. 

,U1 The P olyke„dechain,growingbytw„carbo„seaeh m odule,is 

additional activities in DEBS. 
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[15] Once a polyketide chain traverses the final extender module of a modular 
PKS, it encounters the releasing domain or thioesterase found at the carboxyl end of most 
PKSs. Here, the polyketide is cleaved from the enzyme and cyclyzed. The resulting 
polyketide can be modified further by tailoring or modification enzymes; these enzymes 
5 add carbohydrate groups or methyl groups, or make other modifications, i.e., oxidation or 

reduction, on the polyketide core molecule. For example, the final steps in conversion of 
6-dEB to erythromycin A include the actions of a number of modification enzymes, such 
as: C-6 hydroxylation, attachment of mycarose and desosamine sugars, C-12 
hydroxylation (which produces erythromycin C), and conversion of mycarose to 
10 cladinose via O-methylation. These modifications in various combinations result in 

erythromycins A (compound 2 in Figure 1), B, C, and D. 

[ 1 6] While the detailed understanding of the mechanisms by which PKS 
enzymes function and the development of methods for manipulating PKS genes have 
facilitated the creation of novel polyketides, there remain substantial impediments to the 

15 creation of novel polyketides by genetic engineering. One such impediment is the 

availability of PKS genes. Many polyketides are known but only a relatively small 
portion of the corresponding PKS genes have been cloned and are available for 
manipulation. Moreover, in many instances the producing organism for an interesting 
polyketide is obtainable only with great difficulty and expense, and techniques for its 

20 growth in the laboratory and production of the polyketide it produces are unknown or 

difficult or time-consuming to practice. Also, even if the PKS genes for a desired 
polyketide have been cloned, those genes may not serve to drive the level of production 
desired in a particular host cell. 

[17] If there were a method to produce a desired polyketide without having to 
25 access the genes that encode the PKS that produces the polyketide, then many of these 

difficulties could be ameliorated or avoided altogether. The present invention meets this 
need. 
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SUMMARY OF THE INVENTION 

[18] In one embodiment, the present invention provides methods for the 
computational analysis of polyketides and the computer-assisted design of PKS genes. 

no] In a firs, aspect, the present invention provides a method for representing 
5 th es«ruc«ureofapo, y ket,deand/„raPKSgene,hatencodes,hePKSthatproducesthe 

polyketide by alphanumeric symbols that factlitates computer assisted analysts. 

[20] In a second aspect, the present invention provides a database of 
polykettdes and corresponding PKS genes tha, can be rapidly searched and information 
extracted for a variety of applications. More partteu.ar.y, this database can inc.u e, m 
0 one mode, all known po,yket,des; and in another mode, the po.ykettdes, optional y 

including all intermediates, produced by all known PKS genes or a subset thereof. 

[2,1 inathirdaspecthepresentinventionprovidesamethodforpredicttng 
the structure of a PKS and its corresponding genes from the stature of a polyket.de. 

[22] In a fourth aspect, the present invention provides a method for designing 
15 no ve,PKS g enesca P ab,eofprod„eingades,redpolyketide.Th.saspee,of.hetnven,,on 

is directed to the design and specification of PKS genes via the recomb.ning of modules 
or portions of modules or sets of modules from already known and available PKS genes. 
,n one mode, all possible PKS genes encoding a desired polyketide from a set of genes ,n 
a database are generated. In another mode, only a subset of such possible PKS genes is 
20 genera.edbasedononeormoreparametersseiectedbytheuser.MoreparticuIar.ya 

Ling system is provided to sort the PKS genes designed for a particular target polyketide 
based on any one or more of several criteria, including number of non-native module 
interfaces, number of non-na,,ve protein interfaces, and other parameters as more 
particularly described below or selected by the user. 

[23] In another embodiment, the present invention provides methods and 

polyketide. 
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[24] In a firs, aspect, the present invention provides a library of recombinant 
DNA compounds, wherem each member of sa,d Hbrary encodes a module of a PKS or 
portionsofmodu.es or sets of modu.es having a desired specificity, and the hbrary asa 
whole encompasses all of the members of a desired class of specificities. 

,25] In a second aspect, the present invention provides a method for assembling 
a PKS gene c.uster that encodes a PKS that produces a desired po.yke.ide from known 
and available PKS genes other than the naturally occurrmg PKS genes that produce the 
polyketide in nature. 

[26] These and other embodiments, modes, and aspects of the invention are 
described ,„ more detail in .he Mowing description, .he examp.es, and c.atms set forth 
below. 



BRIEF DESCRIPTION OF THE FIGURES 

[27] Figure 1 shows a schematic representation of the PKS enzyme tha, 
synthesizes 6-deoxyery.hronoIide B (6-dEB, compound,). The PKS ,s composed of 
1S three protems, DEBS 1 , DEBS2, and DEBS3, each of which ,s represented by an arrow 

and contains twoormoremodules. Each modu.e ,s represented by a sohd hne, and .he 
domams in each module are shown inside the arrow. Various intermediates produced 
during the synthesis are also shown, as are .he structures of erythromycins A (compound 
2), B, and D resulting from modification of 6-dEB. 

[28] Figure 2 shows an illustrative set of 2-carbon unit monomers present in 
maerocyclicpolyketides; these monomers can be used to represent polyket.de backbone 
diversity generated by commonly used s.arter and extender umts (malony, CoA and 
methy.ma.onyl CoA) and the condensa.ion and reduction reactions mediated by PKS 

enzymes. 

25 [29] Figure 3 shows a representation of 6-dEB by molecular graph, 

CHUCKLES notation, and SM.LES notation. The CHUCKLES notation uses the 2- 



carbon urn, monomers shown in F.gure 2. In the CHUCKLES notation, the order of 
attachment of monomers is designated by the order in whtch monomers are luted, and the 
attachment point, within the monomers are specfted in their definitions. In the SMILES 
notation, adjacent monomers are attached via single (covalent) bonds deptcted by dashes. 
5 The cyclization bond is represented by the index 1 adjacent to the Start and Close 

monomers. 

[30] Figure 4 is a flowchart and block flow diagram in five parts designated A- 
E, inclusive. 

[31] Flowchart Figure 4A is a block flow diagram of a computer system to 
1 o design a novel PKS (and corresponding genes). 

[32] Flowchart Figure 4B is a block flow diagram wherein the "Computer 
Program" block (2) of Flowchart Figure 4A is further defined. 

[33 ] Flowchart Figure 4C is a block flow diagram wherein the "Design novel 
hybrid PKS genes from library for TARGET" block of Flowchart Figure 4B is further 
1 5 defined. 

[34] Flowchart Figure 4D is a block flow diagram wherein the "align TARGET 
with STARTER; copy to ALIGNMENT" block of Flowchart Figure 4C is further 
defined. 

[35] Flowchart Figure 4E is a block flow diagram wherein the "Rate novel 
20 hybnd designs" block (3) of Flowchart Figure 4B is further defined. 

^ [36] Figure 5 shows a flowchart of a matching method for the generation of the 
J ^ 7 CHUCKLES strings used for all polyketides in a library. 
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DETAILED DESCRIPTION OF THE INVENTION 

[37] Because polyketides synthesized by modular PKS genes are bu.lt by the 
enzymattcally controlled addition of primarily 2-carbon unit monomers and, to a lesser 
extent, other more complex monomers, each polyketide may be represented as a stnng of 
2-carbon unit and other monomers. These monomers represent the portion of the 
polyketide backbone structure as a result of the incorporation of various starter and 
extender units (malonate, methyl malonate, etc.) and the subsequent chemtcal reactions. 

[38] These reactions include: 

(1) condensation reactions, of wh.ch there are three basic reactions: malonyl-CoA 
condensation and methylmalonyl-CoA condensation with the branched methyl having 
either R or S stereochemistry; and 

(2) reduction reactions, of which there are five basic reactions: no reduction 
(ketone preserved), keto-reduction (to yield a hydroxyl having either R or S 
stereochemistry), dehydration (trans double bond), and enoyl-reduction (to yield a 
methylene). 

[39] An illustrative set of the basic monomers that can be used to represent a 
polyketide structure (and their corresponding symbols) comprises: 
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'A; 



= B; 



= C; 



=D; 




= J; 



= L; 



M ; and 



= N. 



A miscellaneous monomer, Q, can be used to denote a portion of the polyketide structure 
that cannot be assigned by monomers A-N. 

[40] The monomer set shown above and in Figure 2 does not represent the 
actual monomers incorporated during biosynthesis. Instead, these monomers include a 
carbon from two different biosynthetic monomers. This is best explained using a 
polyketide fragment depicted below. 




CH 3 CH 3 

The fragment includes two two-carbon units, i and i+1 and part of a third two-carbon 
unit, i-1 that were incorporated into the polyketide during biosynthesis. The i-th extender 
module attaches the two carbon biosynthetic unit whose backbone carbons are designated 
as alpha, and beta, and the second extender module attaches the two carbon biosynthetic 
unit whose backbone carbons are designated as alpha 1+ i and beta,+|. Using the monomer 
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set shown above, this fragment consists of monomer A (derived from the beta carbon 
added in module i+1 and the alpha carbon added in module i) and another monomer A 
(denved from the beta carbon added in module i and the alpha carbon added in module 
i+1). 

OH QH n 
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CH 3 CH 3 
A A 

The fifth carbon designated beta' +1 remains unassigned and will depend on the identity of 
the two-carbon biosynthetic unit that is incorporated in the polyketide by module i+2. 

[41] The set of monomers shown in Figure 2 can be expanded to include other 
starter and extender units, of which there are many. Such starter and extender units 
include for example but without limitation, hydroxymalonate (e.g., mddamycm), 
methoxy^alonate (e.g. FK-520), ethylmalonate (e.g., FK-520), ammo acids or ammo 
acid derivatives that are incorporated into polyketides by the action of a non-nbosomal 
peptide synthase (e.g., thiazole in epothilone and pipecolate in rapamycm), or other units 
incorporated by, for example, an AMP hgase (e.g., the dihydroxycylohexyl moiety m 
rapamycm, FK-506, and FK-520) or a soluble CoA ligase. An illustrative set of 
additional starter and extender units includes: 
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R R Ft 





where R can be anythmg other than hydrogen or methyl (e.g., allyl, butyl, ethyl, hexyl, 

hydroxyl, isobutyl, and methoxy). 

[42] The set of monomers can also include post-PKS modifications, such as 
hydroxylation, methylatton, epoxidation, glyeosylation, or addifon of intra-macrocychc 
fused rings makmg the system polycyc.ic. Also, a variety of methods are known for the 
incorporation of unusual starter and or extender units in polyketide synthases (see, e.g., 
PCX Publication Nos. WO 97/02358; WO 99/03986; WO 98/01546; and WO 98/01571, 
each of which is mcorporated herein by reference, and the monomer se, can include such 



units. 



[43] By viewing polyketides as composed of sets of distinct monomers, one 
can in accordance with the present invention define a polyketide as a stnng of alpha- 
numenc symbols to faditate computer analysts. In one method, a modtfied CHUCKLES 
methodotogy for represent polyket.des ,s used. The CHUCKLES methodology (see 
Siam et al, "CHUCKLES: a method for representing and searching pept.de and pepto.d 
sequence "J. Chen,. Inf. Sc. 34: 588-593 (1994) wh.ch is incorporated herein by 
reference) for representing peptides and related oligomers allows monomers to be strung 
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together such that the molecular graph for the basic macrocycle can be generated from 
the string of monomers. 

[44] For example, using the set of monomers comprising A-N described above, 
the erylhromycm macrocycle or 6-dEB can be represented as ADGJDD. Thrs stnng of 
alphanumeric symbols is also referred to as the CHUCKLES stnng. figure 3 deptcts the 
relationshtp between the CHUCKLES string, the SMILES stnng, and the actual 
molecular structure of 6-dEB. The CHUCKLES string for 6-dEB can be annotated to 
represent the structure of erythromycin A: A(l-lactone closure,2-hydroxyl)DGJ(2- 
hydroxyDDd-glycosy.) D(l-glycosvl). Thus, ring closure (cychzation) and post-syrrthetic 
modifications (glycosylation and hydroxylatton), and non-standard units where 
apphcable (there are none in 6-dEB and erythromycin) are entered between parentheses 
after each monomer. Another examp.e is an annotated CHUCKLES string for epothtlone 
B- ME(l-lactone-closurc)M(epoxide)UDG(2-methylation,E. As above, cychzation, 
post-syntheuc modifications (epoxide formation), and non-standard units (methyl at C-4) 
are entered between parentheses after each monomer. 

[45] In another aspect of the present invention, a database of polyketides is 
provided. In one aspect of the present tnvention, the polyketides are represented by a 
string of defined monomers. In one embodiment, the monomers are selected from a 

group consisting of: 
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= A; 



= B; 



OH OH 




= H; 



= J ; 



K; 



= L; 



= M ; and 



= N. 



[46] In another embodiment, polyketides are represented by the monomers A-N 
s well as additional monomers selected from the group consisting of 



OH OH 

= A'; I /=B'; 



= D'; 



= G'; 



= H' 



= J'; = K' ; and = M ' 



or 

R R R 

where R can be anything other than hydrogen or methyl. 
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[47] The string of monomers can be represented as a linearized structure or as a 
string of symbols. For example, the erythromycin can be represented as its aglycone, 6- 
dEB, as 




OH OH O OH OH O 



or as a string of symbols, ADGJDD. Optionally, the string of symbols can be annotated 
as "A(l-lactone closure,2-hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl) D(l-glycosyl)" to more 
fully capture the erythromycin structure. This set of annotated strings is referred to as a 
"coded library" or a "coded" database of the present invention. 

[48] In an illustrative embodiment, the polyketide database consists of the 
polyketides described in current literature (Journal of Antibiotics (1981 -present), Journal 
of Natural Products) and various databases (Chemical Abstracts CAPlus, AntiBase). All 
unique macrocyclic polyketides ate converted to the modified CHUCKLES format. Of 
the -1000 novel polyketides obtainedyonly -200 different strings of monomers and 
unique macrocycles are needed to represent the much larger collection of polyketides in 
the database, because many of the differences between the naturally-occurring 
polyketides are due to different glycosyl (sugar) groups attached at different positions on 
the macrocycle. ^ 

[49] Thus, a macrocyclic polyketide can be converted to a string of 2-carbon 
monomers by mapping the monomers onto the polyketide. This can be performed 
manually or with computer assistance. First, any sugar moieties are conceptually 
removed by hydrolysis and any lactones (bond between the ketone and oxygen) are 
hydrolyzed thus generating a linearized structure of the backbone of the polyketide. 
Generally, this leaves a carboxy carbon at one end of the linear molecule and a hydroxyl 
at the other. The polyketide is then "sequenced" manually or in silico from the end 
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containing the carboxy carbon, the end corresponding to the last monomer added by the 
PKS before synthesis is complete. This end serves as a convenient handle from which to 
start the mapping process. Although closing of the lactone often occurs between the two 
ends of the polyketide, this is not always the case. However, the last ketone added by the 
5 PKS is almost always involved in macrolactone formation and so serves as a more 

convenient handle than the hydroxyl for commencing sequencing. 

[50] The manual or in silico sequencing is performed by matching the 
monomers, one at a time, while traversing the macrocyclic backbone. First the carboxy 
carbon is skipped, and an attempt is made to match each of the monomers in the 
10 monomer set selected (i.e., monomer set A-N in Figure 2) against the next two carbons in 

the macrocycle. The match takes into account carbon, oxygen, and no substitution at 
each backbone position, chirality at each backbone position, and bond order between the 
two backbone carbons. 

[51] If the sequencing is performed in silico, the method is referred to as back- 
1 5 translation and involves converting a molecular graph into a string of monomers. First, 

the monomer library is converted to SMARTS format. SMARTS is a superset of the 
SMILES language that specifies a pattern in a molecular graph (Daylight Software 
Manual: Theory; Daylight Chemical Information Systems; Irvine, CA 1993, incorporated 
herein by reference). SMARTS permits one to specify a variable number or a limit on 
20 the number of covalent bonds to non-hydrogen atoms from a particular atom. In contrast, 

SMILES assumes that the unspecified valences are hydrogens. For example, the 
SMILES string for monomer A is [C@@H](0)[C(aiH](C). The oxygen may be bonded 
to any other single atom; if the atom is not specified, it is assumed be a hydrogen. In the 
SMARTS string for monomer A, [C(S!(S'.H](0;D2])[C@H]([CH3]), one can specify the 
25 exact number of hydrogens on some atoms (e.g., "CH3"). In addition, the "[0;D2]" 

indicates the oxygen is bonded to two ( from D2) non-hydrogen atoms, in this case the 
first carbon and some other unspecified atom. This allows matching and distinction of 
post-modification moieties attached to the oxygen as well as additional cyclizations (six 
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member nngs can occur within ,he macrocycle; e.g., Thus, the SMARTS 

not* ion ahows pattern matching against the polyket.de molecular graph. 

[52] When a match occurs, the atoms that match are tagged as part of a 
superset and .abeled with the monomer name. Any atoms that are connected to the 
monomer tha, are not part of the macrocycle are tagged for identification as spectal 
precursor units (e.g., ethylma.onate mstead of methyl malonate or malonate), or post- 
synthetic meditation moieties (e.g., sugars, CCHO, hydroxy.ation, methy.at.on). If all 
the atoms and bonds of the monomer cannot be .dent.fied, the monomer is gtven a 
designation to indicate the lack of tdenttficatton (e.g., Q for question mark). These Q 
monomers can he used to identify monomers that are the she of pos.-PKS modifications 
that mask the function of the PKS module tha, generated that portion of the po.ykefde or 
that are not in the monomer set and so prevent the correction of a particular segment of 
the backbone with one of the monomers in the monomer set. 

[53] After a particular 2-carbon unit is identified, the next two carbons are 
processed the same way. This is repeated until a., the backbone carbons are identified 
and labeled as monomers. When all ,wo-carbon untts are identified, one has generated an 
ordered sequence, or string, of monomers, which ,s a modified CHUCKLES stnng of the 
invention. Moieties corresponding to post-PKS mediations are appended to the 
monomer in the string as an annotation in parentheses. This method of sequencing may 
be extended to inciude any type of monomer. Figure 5 shows a flow chart of tin. 
matchmg method for the generation of the CHUCKLES strings used for al. polykettdes m 
a library. 

[54] The CHUCKLES string can be in the order corresponding to the direction 
of biosynthesis on the PKS or ,.s reverse. Each CHUCKLES stnng has a one-,o-many 
r e,a«,onship with the PKS gene in the producing organism. Thus, while many different 
organisms can produce the same polyket.de using the same or different PKS genes, each 
PKS gene genera.ly produces only one PKS that produces on!y one polyket.de (some AT 
domains can bind d.fferen, CoAs, leading to the production of multiple polyket.des from 
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a single PKS). This allows one to design, from the polyketide structure, a set of PKS 
genes that would produce that polyketide. 

[55] Thus, the\present invention provides methods and computational analysis 
tools for designing PKS genes to produce a desired polyketide. As an illustrative 
example, the present invention provides a computer program termed MORPH (see the 
Examples below) that can reati the coded library (see the Examples below). An 
illustrative coded library consists of -200 unique polyketide CHUCKLES strings. The 
user specifies the target polyketide, which is converted from molecular structure to a 
CHUCKLES string. \ 

[56] The program then performs the following, starting with each library 
compound or string: 

(1) aligns library compound and target compound, emphasizing alignment of 
adjacent monomers common between the two; 

(2) fills in the gaps using all possible combinations from all library members; 
(3 ) counts number of non-natural inter-modular boundaries, 

(4) outputs all these alignments. 
The alignments are then sorted based on the number of non-natural inter-modular 
boundaries. 

[57] This illustrative program allows one to design and find PKS genes that 
encode PKS enzymes that are combinations of two or more different PKS enzymes with 
the fewest inter-modular boundaries, and optionally the fewest inter-protein boundaries. 
Many other alternative embodiments are provided by the present invention. 

[58] For example, one can include the naturally occurring PKS that produces 
the target polyketide in the coded library to allow components of that PKS to be 
incorporated into the design of a new PKS. Also, one can include in the coded library 
non-naturally occurring PKS enzymes, such as those produced and published in the 
scientific and patent literature to make novel polyketides, in the coded library. See, e.g., 
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p CTpuW ,caU„„No,W09S/4,3 15 an d WO«, b o t Hofw.c h a r e,„co r po^ 



herein by reference. 



[59] Th isCHUCKL E S-co<,ed P o, yk e«de., b ra ry canbes,ore di nacon,pute r 

===E£EZ=~ 

CHUCKLES-coded TARGET polyketide ^ rom ^ ^gy^j^ER^kft in the library, i.e., 

as part of the TARGET sequence by the user. 

a ftpr a STARTER is chosen, the TARGn 1 is diigu 

u — —rssssssi. 

example, if the TARGET contains the JDG substring, 
A1D2G3J4D5D6 CHUCKLES string, may align as 
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TARGET 1 JPG 

6-dEB I J4D2G3 

or 

TARGET I JPG 

6-dEB | J4D5G3. 

Both of these alignments have different maximal adjacent modules, with the 
5 same length of two (D2G3 in the first and J4D5 in the second). Accordingly, either 

alignment could be used as STARTERs. 

[62] With the optimized alignment from the STARTER, other library entries 
are systematically used to complete the alignment, or fill in the gaps. This part may be 

1 0 performed on either the optimized ALIGNMENT described above, or the ALIGNMENT 

without the single modules from the STARTER; the removal of the individual modules 
opens up more space into which larger pieces of the FILLER might be placed. The first 
library entry is designated as the FILLER. If the FILLER is the same as the STARTER, 
the next library entry is used as the FILLER. This library entry is flagged as the 

1 5 CURRENT FILLER LIBRARY ENTRY. The same method for finding maximally 

adjacent modules and then smaller sets or single modules is used to fill the gaps in 
ALIGNMENT from the FILLER. If not all the gaps are filled in the ALIGNMENT, then 
the next library entry is used as a new source; that is, it is designated as the FILLER, and 
the gaps are filled further. This is repeated until the ALIGNMENT is complete or the 

20 end of the library is reached. 

[63] Assuming all modules in the TARGET are represented in the library, the 
ALIGNMENT is eventually completely filled. The completed alignment is then written 
to an output file on the computer disk. When the ALIGNMENT is complete, or there are 
no more FILLERS in the library, the TARGET and STARTER alignment are re-copied to 
25 ALIGNMENT. The CURRENT FILLER LIBRARY ENTRY is incremented, and a 

new attempt to fill in the gaps is started. 
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[64] When the CURRENTFILLERLIBRARYENTRY has reached the end 
of the library, the ALIGNMENT is wiped, and a new STARTER is chosen. The above 
process is then repeated for the next STARTER. When all library entries have been used 
as starters, then all feasible novel polyketide synthases have been generated and written 
to the computer file. The novel PKSs are then read back into memory and can be further 
evaluated. An illustrative evaluation process involves: 

(1) counting the non-native inter-module interfaces, and 

(2) counting the number of native inter-protein interfaces (for known and 
annotated gene sequences). 

The novel PKSs are then sorted based on these two numbers, giving higher priority to the 
non-native inter-module interfaces. In this mode, the goal is to identify those novel PKSs 
that contain the fewest non-native interfaces. 

[65] By providing methods and means for the computer-assisted analysis of 
polyketides and PKS genes, the present invention greatly facilitates the identification and 
production of new polyketides with useful activities. Those of skill in the art will 
appreciate that while the invention is in part illustrated in the Examples below with 
respect to the design of new PKS genes for known polyketides, the invention can also be 
used to design PKS genes for novel polyketides. In this embodiment, one simply 
provides the structure of the novel polyketide to the MORPH or other program of the 
invention to generate the desired PKS genes. 

[66] Moreover, while the invention is exemplified below by designing new 
PKS genes composed of the coding sequences for one or more complete modules of two 
or more different PKS genes, partial modules can also be employed. With the 
appropriate choice of monomer sets and corresponding coding of the library to be 
searched, one can generate new PKS gene designs that take advantage of the potential to 
fuse one PKS gene coding sequence to another at a site corresponding to an intra-modular 
junction. In another embodiment, one can use "wild-cards" in the encoded polyketide or 
library to take advantage of known or predicted SAR. Thus, if one knows that a 



22 



particular position in a polyketide can be varied (i.e., a hydrogen, methyl, or ethyl group 
at a location determined by an AT domain of a particular module, or a hydroxyl or keto 
group at a location determined by the presence or absence of a KR domain in a particular 
module) then one can use a wild-card monomer designation in the polyketide 
5 CHUCKLES string to generate PKS genes that produce each of the desired variants. 

[67] The methods of the invention have diverse application in addition to the 
design of new PKS genes. As but one illustrative example, the methods of the invention 
can be used to design methods to produce a desired compound. Organic molecules 
containing stereochemical centers are useful for a number of purposes, including use as 
10 synthetic or semi-synthetic intermediates. The preparation of such intermediates by 

organic synthesis can be extremely time consuming and expensive. An alternative source 
of such intermediates is via specific degradation of a polyketide, and the present 
invention provides computer-assisted means for designing such production methods. 

[68] Thus, certain functional groups of polyketides are susceptible to bond 
15 cleavage by specific chemical reactions that do not affect other functional groups. For 

example, carbon-carbon double bonds can be specifically cleaved by permanganate 
without affecting other functional groups normally in polyketides, such as ketones, 
alcohols, and lactones. Likewise, the Baeyer-Villager reaction converts a ketone to an 
ester (lactone) without affecting other groups of the aglycone. In accordance with the 
20 methods of the invention, one can assemble a library of polyketides in a database that can 

be addressed with a query describing a particular chemical reaction to generate all of the 
degradation products produced by that reaction upon each of the polyketides in the 
library. The degradation fragments thus generated serve as a library of the invention that 
can be sorted by properties, such as size, number and type of stereochemical centers, 
25 functional groups, or other factors, and searched for useful compounds. Moreover, the 

functional groups on the ends of the fragments generated (or at other locations) can also 
be converted to other functional groups by chemical reactions (optionally employing 
protecting groups on other functional groups), and the database of compounds can be 
expanded to include the compounds produced by such reactions. 
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[69] From even a modest library of -200 compounds, one can in this manner 
generate using the methods of the invention, two to three times as many valuable 
chemical intermediates. Once such an intermediate is identified, the organism that 
Sv, P ^ ' produces the polyketide from which the fragment is derived is fermented, the polyketide 
5 isolated in bulk, the chemical reaction performed, and the desired degradation product(s) 

isolated and used. In this manneiyhe present invention makes available a wide variety of 
useful products otherwise unattainable. 

[70] Thus, the present invention has wide application in the fields of chemistry, 
particularly medicinal chemistry, molecular biology, and medicine. Those of skill in the 
1 0 art will recognize these and other benefits and applications provided by the present 

invention. Thus, the following examples are given for the purpose of illustrating the 
present invention and shall not be construed as being a limitation on the scope of the 
invention or claims. 



EXAMPLE 1 

15 The MORPH Program 

[71] This example provides the source code for an illustrative MORPH 
program of the invention. The MORPH program is a command line driven program that 
runs on a UNIX system. The program can be run from a shell script, such that the user 
fills in the entire command ahead of time, then post-processes the output file with UNIX 
20 utilities including sort, egrep, and uniq. 

[72] The command line appears as follows: 

morph3 -1 libraryfile -n targetname -t targetsequence [-x X-wildcards] [-y Y-wildcards] 
[-z Z-wildcards]. 

25 [73] The library file is the name of the text file described below in Example 2. 

The target name is a user-defined identifier to distinguish this target from the library 
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.h-lnnd* The target sequence is a string of monomers that represent 
members (e.g., epoth.loneD). The targ q (argel 
the CHUCKLES-encoded target polyketide (e.g., MEMLJDGb). u 

ntelibrarytttscommentedou, from the hhrary so .ha. , he morph program 
sequence is m the library, independent 
d0 es no. find the targe, i.se,f. The three Afferent w.Ucards, X, Y, Z, 
sets of monomers that can be tncluded in the targe, sequence. 

m The„u. P u.fro m .hemorphprogramcanberedirec,ed,oaf,lc.Th,s 

appear at the top of the output. 

r751 Belowaresomei—^ 

[ 1 ♦ t The first example generates combinations 

from a shell script using epothilone as a target. The first examp 

that yield epothilone D: 

0/ mhl 1 PKS lib -n epoD -t MEMLJDGE > omorph3_epoD 

tgrep HIT oepoD-BOH , sor. , un, , sort + ,0 -U > oepoD-BOH.un.q.sor, 
%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCU y 

%m orph3 PKS.„b -n epoD-se,2 -t MEXYZDgE -x JK -y EF -z JACGM 

oepoD-set2 



%gre „ HIT „epoD-set2 | sort | unic, | sort + .0 -U > oepoD-se,2.un,e,s„rt 

„ 6] MORpkin i,s current nnplementauon operates a. the monomer ieve. and 

.nsao snot^einV-— ^^^T^^ 
convert ^ aBXK U^^^«^^^^ 

and then perform more comW chemrea. ana.ys.s of the PKS mo,ecu,ar graph . 

generally unknown. \ 

[77] The source code for MORPH is found in Appendix A (version 3.0) and B 
(version 4.0) (deposited in the microfiche appendix). 



EXAMPLE 2 



mustr^yeMyJcetia^^ 

[78 , Th,s example provides the contents of an ..lustrativeCHUCLKES 

,wh Hhrarv The first column provides the name of the polyket.de; the 
encoded polykettde hbrary. The P . and the fourth 

second the CHUCKLES string; the third the annotated CHUCKLES string, 

complete for all of the polyketides. 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


3-acetyl-4"- 
butyltylosm 


FMNGODF 






#aculeximycin 


RRNRSMRSRSS 
SSRLSSN 


RR(2-ethyl)NRSMRS( 1 - 
glycosyl)RSSS(2-hydroxyl)SR( 1 - 

olvnncvld SSN(? Pthvll 
giyL-Uoyi ^ooiN^-ciiiyi ) 




albocychne-Ml- 

lngramycin- 

TA2407- 

U28010- y SR2077 


BLME=JN 


BLM( 1 ,2-epoxy)E( 1 -methoxy)=J(2- 
hydroxy 1)N 




«ilbocyclinc-M2 


BLME=JN 


B(2-hydroxyl)LME( 1 -methoxy )=J( 2- 
hydroxyl)N 






BSME=JL 


BSME(l-methoxy)=J(2-hydroxyl)L 




albocycline M5 


BSME=JN 


BSME( l-methoxy)=J(2-hydroxyl)N 




albocycline-M6 


BLME=JL 


BL(2-hydroxyl)ME( 1 -methoxy )=J(2- 
hydroxy 1 )L 




albocyclme-M7 


BLME=JN 


BL(2-hydroxyl)ME( 1 -methoxy )=J( 2- 
hydroxy])N 




albocycline-M8 


BLME=Q 


BLME( 1 -methoxy )=Q 




aldgamycin 


BMLGJDL 


B( 1 -cyc)MLG(2-hydroxyl)JDL(2- 
cyc) 




amphoten c in A 


V^L^IN IN I^IN IN IN IN r 

CEFEALEE 


^.l/ininlin in in in r ^ i -giycubyi 1 yj 
cyc,2-carboxylicacid)EF( 1 - 
cyc)EALEE 




#amphotericinB 


CDNNNNNNNF 
QQQEELEE 


CDNNNNNNNF( 1 -glycosyl)C( 1 -0- 
cyc,2-carboxylicacid)EF( 1 - 
cyc)EALEE 




angiolam 


IN IVlr J IN o J ll^ljO.fV 






— i : a ~~ 

aplyronmeA 


D r J J_/ 1 N r r i vi c jv 

AFNN 


B(1-C(=0)C)F(1- 

C( =0 )C( C)N(C)C)JCENFFME( 1 - 

methoxy)KAF(l- 

C( =0 )C(N(C)C)COC)NN 


sea hare Aplysia 
kurodai 




EFLMNAMMM 


EF(methoxy-l ,hydroxy-2)LMNAMMM 


aurachinB 


MLMLM 


MLMLM 




aurachinC 


MT MT M 

IVll^lvlLlVl 


MLMLM 




A59770 


QQKQQLJFCDN 


QQK(2-ethyl)Q=QLJ(2- 

hydroxy])F(2-0-glycosyl)CD(2- 

hydroxy])N 


Amycolatopsis 
onentalis 


noZJHort- 

cytovaricin 


QQQKQNLJFDD 
N 






A83543A 


FLFQQQ 


FLFQQQ 


Saccharopolyspora 
spinosa 


AB023a 


NNNNNRSSSLS 
R 


NNNNNRSSSLSR 




AH-758 


RSNMURMN 


RS(2-methoxy)NMURMN(2- 
methoxy) 






Ibnhcenmkcm 

N 




borrelidin 



calyculm 



candicidm- 
candeptin-ascosin- 
levorin-etc 



candidin 



BNHCE(l-macrocyc,2- 
methoxy)NMKCMN(2- 
methoxy)Q(keto-macrocyc) 
BNHCE(l-macrocyc,2- 
methoxy)NMKCMN(2- 
methoxy)Q(keto-macrocyc) 



cdnnnnnnnf" 

CEELIEE 



CDNNNNNNNF( 1 -glycosy 1)C( 1 -0- 
cyc,2-carboxylicacid)EE( 1 -cyc)E(2- 
hydroxyl)LIEE 



QDNKNNNNNF 
CEFELIEE 



carbomycin 



' ENNCODF 



RDNNNNNNNF(1 -glycosyl )C( 1 -0- 
cyc,2-carboxylicacid)EF(l-cyc)E(2- 
hydroxyl)LIEE 



carbomycinB- 
magnamycinB 
carbomycm-A- 
magnamycin- 
deltamycinA4- 
NSC51001-PS97- 
362^WC3628__ 
chalcomycin- 
myconomycin- 
aldgamycinDmiko 
nomycin 



[fnngoff ~~ 



ennhoff" 



S. griseus, S. 
canescus, S. 
levoris, S. 
vindoflavus, Stv. 
grisoviridum 



nyuiuAyi;^^^ 

EN(1 ,2-epoxy)NHOF( 1 -glycosyl,2 

methox)0Fl!^^ - 

FNNGOF(l-glycosyl,2-methoxy)F(l 

CeO)Q 



^)^J 

ENNHO(includes CCHO)FF 



BNNGKDN 



SRRnNNSRQNn 
RMnNn 



cineromycinB 



cineromycinBdehy 



cineromycinB2,3di 



B(2-0-glycosyl)N( 1 ,2-epoxy)NG( 2- 
hydroxyl)KD(l-glycosyl)N 




S. bikiniensis, S. 
albogriseolus 



BLME=JN 



B(0-ethyl)MNC(l -glycosyl )OD(J 

glycosyl)F 

SRR( 1 -O-macrocy c)nNNSR( 1 - 
methoxy)QNnR(l- 
glvcosyl)MnNnO(keto-macrocy c)__ 



S. ambofaciens ka- 
448 

S. cellulosum 



BLME=J(2-hydroxyl)N 



BLMI=JN 



cirramycinB- 
cirramycinBl- 
Acumycin- 
A688A-B58941- 



BLMI=J(2-hydroxyl)N~~ 



cinereochromogene 
s, S. s 



BLME=J(2-hydroxyl)lT 
BM( 1 ,2-epoxy)NGOD(T^y^yOF _ 



amngodf" 



AM(1 ,2-epoxy)NGODOiiycosyl)F" 



[f5TvKETiDr|cHiiciu^ 



cladospolideB 



ELLFn 



cladospolideC 



concanamycinA- 
folimycin-A661-l- 
S45A-TAN1323B- 
X4357B 



ELLEN 



|ELLT(>hydroxyT)^ 
ElIl{2^hydroxyT)^ 



TSOURCE 
JoRGANISM 

|Ciadosporium 
fulvum, C. 
Jcladosponodes 



cenmkccmn" 



concanamycinB- 
S45B 



concanamycinG- 
anhydroconcanam 
ycmB 



cenmkccmn 



CE(2-methoxy)NMKC(2- 
;thyl)CMN(2-methoxy ) 



Cladosponum 
fulvum, C. 
cladosponodes 
fungus 

Cladosponum 
tenuissimum 



CEC^n^o^ONM^ 
methoxy) 



diastatochromogen 
ese, S. sp, S. 
:va gawaensis 



diastatochromogen 



NBNHCENMKC 
CMN 



NBNHCE(2- 

methoxy)NMKCCMN(2-methoxy) 



|ajm^______ 

Mnrssssssrlr 
I ^rn_ 

bo^cM^O^ 
I N 



A^oxazoynJM^JM] 



t \\ 1 -KJ^U-^^J y » 

RNRSSSS(l-0-cyc)S(2- 
hydroxyJ)SO^yc)RLR^ 



IcytovaricmB 



nyaroxyim i-vj^^ "" — 

QQQKQNLJ( 2-hydroxyl)F(2-0- 
| N £lycojyJ)CD(2Ji^^ 

vv re i W nwnrD(2-hvdroxyl)N 



ARGKDEr 



l damavaricin C_ 
deltamycinAl 



QDQCCNM^ 



ENNGOFF 



^lycosyjlCDT^d^ 

A(2-glycosyl)R(l- 
C(=0)C(0)C(C)C)G(2- 
hydroxyl)KDD(l-glycosyl) 




deltamycinX- 


ENNGOFF 


desisovalerylcarbo 




ImycinA 




lengleromycin 


qnjhn" 


|#epothilone 


MEMUDgE__ 


erythromycin 


ADGJDD 



QDQCCNM . 

" EN( 1 ,2-epoxy)NGOF( 1 -glycosyl,2- 

methoxyjFO^OJO — - 

EN( 1 ,2-epoxy)NGOF( 1 -glycosyl.2- 
methoxy)F(l-C(=0)C) 

lQNJFii2liydroxy^^ 

TSlO]Eio^^ 



rnirpmilL 7 
hago^rriUd^oxy_ 
Rirrpln4agosrri 



^ennnnmf^ 
FennnnnifJfff| 

FFF 



S. filipmensis, S. 
durhamensis 
jsTfilipmensis, S7 
durhamensis 
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foromacidinB- 



TSOURCE - 
JoRGANISM_ 





RRUS( 1 -O-macrocy c )NNL(2 
hydroxyl)N( 1 ,2-epoxy)RMMQ(keto- 

macrocyc) 



macrocycj — 

R( 1 -methoxy)N( 1 ,2-epoxy)RNMR( 1 
0-macrocyc)NR(l-C(=0)C,2- 
hy^drox^l)LSQO^ 

cyc,2-carboxylicacid)EF(2- 
cy_c)EEEEIE 



S. hygroscopicus 



S. aureofaciens 



QUU2^h^)RMS^^ 
methoxy)nM_ 



■( l-0-cyc)I(2-cyc,2-hydroxyl)L 



B(2-0-glycosyl)N(l,2-epoxy)LG(2- 
hy^roxyl)W^^y^ 



S. hygroscopicus 
var. gelanus 



Colletotrichum 
gloeosporioides f. 
ussiaea 

1TgerTT55 



QNC(l-methoxy)R(l 

|cXzO]QCBNM__ 

" A(l-methoxy)L(2-methoxy)D( 1 
methoxy)MF(l-CONH2,2 
methoxy)nM 



Mic. halophytica 



CE(1 -O-macrocyc, 2- 
methoxy)NMJCMMQ(keto- 

macrocyc) 

BNHCE( 1 -O-macrocyc ,2- 
methoxy)NMJCMN(2- 

methox^0Q(k5t2^^ 
UR( 1 -O-macrocyc )RNS( 1 - 
glycosyl)LMR{ 1 -0-cyc)=QS( 2- 
U^)OJce^macrocycX____ 
]BMNGFDF(includes an ethyl in pos 

2) 



S. hygroscopicus 




Mic. chalcea 



Mic. chalcea 



iMic. chalcea-Mic. 
capillata 
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L681H0 



Act. sp, S. lucensis, 
S. glaucus 



macrocin- 
lactenocin 



maridomycinl 



S. aureus, S. lutea, 
K. pneum, B. 
subtl., Shva.. 



QMNGODF 



ENNCOFF 



mandomycm- 
platenomycinC3- 
turimycinEP5- 
B5050A-YL704- 

[mathemycinA 

midecamycmAl- 
platenomycinBl- 
SF837 



lENNCJDF 



S. hygroscopicus- 
S. platensis- 
malvmus 



Irssrssrssrr 
|mrmrnlrl_ 
Ienncoff 



FENNCOF72^methoxy)F 



S. mycarofaciens 
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rxovirescinE 
myxovirescinFl 



QQEFLNNLLLL 
|jJ 

^QQETLNNLLLK 
J 



rmyxovirescinF2 
fmyxovirescinGl 

^ — ^ Iqqfjlnnllil 

qqeflnnllilj 

J 



myxovirescinHl 



|myxovirescinH2 



P^rescnL ~|pLNNLLIL 

1 |lkj i 

QQEFLNNLLIL 
LJJ 

QQEFLNNLLIL 
KM 



myxovirescinP2 



W^xowescinQ 
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QQQNLLKFDD 



phenalamid 
IphenalamideAl- 
fenalamid^lO^-C 



nCNNNNM 



JMCMNnNM 



[ptoaTamideA2- JMCMNNNM 

102£L__— — -i-— w 

|ptoa]amiB UMCMNnNM 




protostre ptovancm 

4 ISNNSSSUSRRQ 
RRSS 
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S. griseus 
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[19] In ano*ere m bodi m e„ t , t hepo lyk e,,de 1 ,brar y ,„c 1 udest he n am eof l he 
<■' ' ■ —i— of the structure. The 



oolykefde, the CHUCKLES string and a hnearized represents 

linearized representatioi 
are as follows: 



:ion o 

of the CHUCKLES structures for erythromycin and epothilone 





An „ l us,rat,veexar„p,ofapo,y k e,,de llb ra,ycon,a,n,n g Unea riz edrepresen,,,onsc 
thetr structures ,s found in Appendix C (deoos.ted in the microfiche append*). 

EXAMPLE 3 



AJtejffiKjve^KSGenejL^ 

[80] Th.sexamp.eiUustratestheahgrmrentanddes.gnofnoveiPKSgenesfor 

the target epothilone. Epothilone is first converted into CHUCKLES string format and 
readtntotheMORPHprogramasaXARGET. The program then generates a„ 
ollahgnmentsofhhrarymodules and sorts the ahgnments to deterrmne preferred 
iLolfmodulesforgeneconstruct.onandproducttonofepoth.lonev.aanove, 

polyketide synthase gene. 
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[81] The epothilone D structure above was first opened at the macrolactone 
ring closure between the C-l-ketone and the C-15-oxygen. The monomer set shown m 
figure 2 was then matched against each of the successive pairs of macrocyclic backbone 
carbon atoms, starting with C-2 and C-3, which match monomer E. The next two carbon 
atoms C-4 and C-5 match monomer G with an addi.tonal post-synthettc methylafon on 
C 4 C-6 and C-7 match monomer D. C-8 and C-9 match monomer J. C-10 and C-l 1 
match monomer L. C-12 and C-13 match monomer M. C-.4 and C-15, where C-!5 has a 
hydroxyl substitute (modified by thtoesterase to dose the macrocycle), match monomer 
E. C-16 and C-17 match monomer M. 

[82] The rest of the molecule, a methyl-substituted thiazole moiety, does not 
match any of the monomers in the monomer set. This moiety corresponds to a malony. 
CoA loadtng module and an NRPS module that together generate the methyl-substttuted 
th.azole moiety. Th,s moiety is thus omitted from the CHUCKLES string generated from 
this illustrative monomer set but can be added stmp.y by adding a monomer to the set. 
The CHUCKLES stnng generated ,s EGDJLMEM, which ,s in the reverse order of 
biosynthesis. This sequence ts then reversed to MEMUDGE to yic.d a monomer 
sequence that matches the order of biosynthesis. The sequence ts then annotated to 
account for the pos,-syn.het,c modifications as follows MEMUDG (2-me«hyl)E. 

[83] This target sequence is provided to the MORPH program to generate all 
possible combinations of modules ,n the CHUCKLES-encoded library tha, will yield *e 
target CHUCKLES. The valid combinations are then sorted in increasing order of non- 
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„a«,ve inter-module interfaces. In one lra plementation, a MORPH run generated 3,452 
valid sequences of five inter-module mterfaces. Of these, none contain fewer than five 
inter-module interfaces. Some illustrative sample module combinations appear below. 
The combmattons are shown listing each monomer followed by a colon and the name of 
the polyketide(s) from which it is denved, followed by a parenthetical showing the 
associated monomers in that polyketide. Verttcal lines represent modular junctions 
between two different polyketides. 

[84] Illustrative PKS Gene 1 : 

M:3acet y 14"but y lrylt y lo S in(FMN) | E:tedanolide(GEH) | M:aldgamycin(BML) 
L:aldgamycin(MLG) | J:aldgamycin(GJD) D:aldgamycin(JDL) | G:tedanolide(JGE) 
E:tedanolide(GEH) 

[85] Illustrative PKS Gene 1 thus comprises one or more open reading frames 
that encode, in the order listed, the module from the a cetyl-4"-butyryltylosin PKS that 
corresponds to monomer M, the module from the tedeanolide PKS corresponding to 
monomer E, the modules from the aldgamycin PKS corresponding to monomers M, L, J, 
and D, and the modules from the tedanolide PKS corresponding to monomers G and E. 

[86] Illustrative PKS Gene 2: 

Malbocyclme-Ml-ingramyc,„-TA2407-c,neromycinB-U28010-SR2077(LME, 
E - a lbocycline-Ml-ingram y c,n-TA2407-cineromyc,nB-U28010-SR2077(MEJ)| 
Malbocycline-Ml-.ngramycin-TA2407-cineromycinB-U280.0-SR2077(LME)| 
L albocycline-Ml- i„gramyc,n-TA2407-cineromyeinB-U28010-SR2077 (BLM) | 
Jerythromycin(GJD) D:erythromycin(JDD) | G:teda„olide(JGE) E:tedanolide(GEH) 



37 



EXAMPLE 4 

, PICS Genes to ' 6.n,-nxvervthronolide B 



[87] This example „,us«rates the alignment and design of novel PKS genes for 
,he erythromycin bas,c polyket.de structure (6-dEB) using the MORPH program. 
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For the 6-dEB structure above, the CHUCKLES stnng is generated by first opening the 
macrolactone ring closure between the C-l-ke,one and the C-13- oxygen. Using the 
monomer set and matching protocol descnbed in Example 3, one generates the 
CHUCKLES stnng DDJGDA, in the reverse order of biosynthesis. This sequence ts then 
reversed to ADGJDD to yield the monomer sequence that matches the order of 
biosynthesis. The sequence ,s then annotated to account for the postsynthetic 
modifications (erythromycin A) as follows A(Z-hydroxyl) DGJ(2-hydroxy.)D(l- 
glycosyl)D(l-glycosyl). 

[88] This targe, sequence is supplied to the MORPH program to generate all 
possible combinations of modules in the CHUCKLES-encoded library. The valid 
combinations are then sorted in increasing order of non-native intermodule interfaces. In 
one implementation, a MORPH run generated 19,63, valid sequences of less than or 
equal to five inter-module interfaces. Of these, .3,306 contain 4 ,n,er-module interfaces, 
and 256 contain only 3 inter-modu.e interfaces. Some of these contain only two inter- 
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module faces, and one only con.ains one. Some illustrate sample module combinations 
follow. 

[89] Illustrative PKS Gene 1 : 

A:amphotericinA(EAL) | D:aldgamycin(JDL) | G:mycinamicm(NGJ) 
5 J:mycinamicin(GJD) D:mycmamicin(JDN) | D:amphotericinA(CDN) 

[90] Illustrative PKS gene 1 thus comprises one or more open reading frames 
that encode, in the order listed, the amphotericin PKS module corresponding to monomer 
A, the aldgamycin PKS monomer corresponding to monomer D, the mycinamicin PKS 
10 modules corresponding to monomers G, J, and D, and the amphotericin PKS module 

corresponding to monomer D. 

[91] Illustrative PKS Gene 2 : 

A:amphotencmA(EAL) | D:aldgamycm(JDL) | G:ptoomycin(NGJ) J:pikromycin(GJD) 
D:pikromycin(JDH) | D:aldgamycin(JDL) 

15 

[92] Illustrative PKS Gene3 : 
A:lankamycin-ku J imycin-landavamycm-A20338N2(-AD)D:lankamyc 1 n-ku J imycm- 
landavamycin-A20338N2(ADG) G:lankamycin-kujimycm-landavamycin- 
A20338N2(DGJ) J:lankamycin-kujimycin-landavamycin-A20338N2(GJD) | 
20 D:ossamycin(FDD) D:ossamycin(DDN) 

[93] Illustrative PKS Gene 4: 

A-amphotericinA(EAL)|D:lankamycin-kujimycin-landavam y cm-A20338N2(ADG) 
G:lankamyc 1 n-kujimycin-landavamycin-A20338N2(DGJ)J:lankamycin-ku J imycin- 
25 landavamycin-A20338N2(GJD) | D:A82548A-cytovaricin(FDD) D:A82548A- 

cytovaricin(DDN) 
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[94] Illustrative PKS Gene 5: 

A:lankamycin.ku j imyc 1 n-landavamycin-A20338N2(-AD)D:lankamycin-ku jm yc I n- 
landavamyc,n-A20338N2(ADG)G:lankamycin-kujimyc,n-landavamycm- 
A20338N2(DGJ)J:lankamycin-kujimycin-landavamycin-A20338N2(GJD) 
D: l a„kamycin-kujimycin- 1 a„davamy C1 n.A20338N2(JDD)D:lankam y ci„-ku J imycm- 

landavamycin-A20338N2(DD-) 

[95] Thus, the present invention provides a useful means to generate new PKS 
genes and correspond^ enzymes to produce polyketides. The invention having now 
been described by way of written description and examples, those of skt.l in the art w.,1 
recogntze that the invention can be practiced in a variety of embodtments and that the 
foregoing descnpt.on and examples are for purposes of .Lustration and not hmttation of 
the following claims. 
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